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ASYMPTOTIC MINIMAXITY OF FALSE DISCOVERY RATE 
THRESHOLDING FOR SPARSE EXPONENTIAL DATA 1 

By David Donoho and Jiashun Jin 

Stanford University and Purdue University 

We apply FDR thresholding to a non-Gaussian vector whose 
coordinates Xi, i — l,...,n, are independent exponential with in- 
dividual means /i;. The vector jj, = is thought to be sparse, with 
most coordinates 1 but a small fraction significantly larger than 1; 
roughly, most coordinates are simply 'noise,' but a small fraction con- 
tain 'signal.' We measure risk by per-coordinate mean-squared error 
in recovering log(/^;), and study minimax estimation over parameter 
spaces defined by constraints on the per-coordinate p-norm of log(/ii) : 

We show for large n and small r\ that FDR thresholding can be 
nearly Minimax. The FDR control parameter < q < 1 plays an im- 
portant role: when q < 1/2, the FDR estimator is nearly minimax, 
while choosing a fixed q > 1/2 prevents near minimaxity. 

These conclusions mirror those found in the Gaussian case in 
Abramovich et al. [Ann. Statist. 34 (2006) 584-653]. The techniques 
developed here seem applicable to a wide range of other distribu- 
tional assumptions, other loss measures and non-i.i.d. dependency 
structures. 

1. Introduction. Suppose that we have n measurements Xj which are 
exponentially distributed, with (possibly different) means fif. 

(1.1) Aj~Exp(^), m > 1, i = 1, . . . ,n. 

The unknown /V s exhibit sparse heterogeneity: most take the common value 
1, but a small fraction take different values greater than 1. 

There are various ways to precisely define sparsity; see [3], for example. 
In our setting of exponential means, the most intuitive notion of sparsity is 
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simply that there is a relatively small proportion of m 's which are strictly 
larger than 1: 



Such situations arise in several areas of application. 

• Multiple lifetime analysis. Suppose that the Xi represent failure times 
of many comparable independent systems, where a small fraction of the 
systems — we do not know which ones — may have significantly higher ex- 
pected lifetimes than the typical system. 

• Multiple testing. Suppose that we conduct many independent statistical 
hypothesis tests, each yielding a p- value pi, say, and that the vast ma- 
jority of those tests correspond to cases where the null distribution is 
true, while a small fraction correspond to cases where a Lehmann alter- 
native [13] is true. Then Xj = log(l/pj) ~ Exp(/ij), where most of the 
/ij are 1, corresponding to true null hypotheses, while a few are greater 
than 1, corresponding to Lehmann alternatives. 

• Signal analysis. A common model (e.g., in spread-spectrum communi- 
cations) for a discrete-time signal (Yt)t=i takes the form Yj = J2j Wj x 
exp{\/— lAji} + Zti where Zt is a white Gaussian noise and the Xj in- 
dex a small number of unknown frequencies with white Gaussian noise 
coefficients Wj. In spectral analysis of such signals, it is common to com- 
pute the periodogram I(u>) = | n -1 / 2 y t exp(V— lwt)| 2 and consider as 
primary data the periodogram ordinates Xi = i"(-^p), i = 1, • • • ,n/2 — 1. 
These can be modeled as independently exponentially distributed with 
means fii, say; here, most of the /ij = 1, meaning that there is only noise 
at those frequencies, while some of the /Zj > 1, meaning that there is sig- 
nal at those frequencies (i.e., certain frequencies u)i = — happen to match 
some Xj). In an incoherent or noncooperative setting, we would not know 
the Xj and, hence, would not know which //j > 1. 

The simple sparsity model (1.2) is merely a first pass at the problem; in 
applications, we may also need to consider situations with a large number 
of means which are close to, but not exactly, 1. A more general assumption 
(adapted from [3, 7]) is that for some < p < 2, the log means obey an £P 
constraint, 



(1.2) 



#{i ■ Mi + 1} 



< e «0. 



n 




r] small, < p < 2. 



Working on the log-scale turns out to be useful because of the 'multiplicative' 
nature of the exponential data. The parameter p measures the degree of 
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sparsity of //. As p — ► 0, 

n 

^logPfo)— ^1}. 



i=l 



1.1. Minimax estimation of sparse exponential means. We now turn to 
simultaneous estimation of the means p^. Let fi = (fJ,i,fJ,2, ■ ■ ■ ,/U n ) and sup- 
pose we use the squared ^ 2 -norm on the log-scale to measure loss, 

n 

|| log/1- log/U,||| = ^(log/ti - log/ii) 2 . 

i=l 

Motivated by situations of sparsity, we consider restricted parameter spaces, 
namely ^ p -balls with radius ij, 



(1.3) 



1 



i=l 



We quantify performance by means of the expected coordinatewise loss 



Rn(fi,H)=£ 



■n 



i=l 



We are interested in the minimax risk, the optimal risk which any estimator 
can guarantee to hold uniformly over the parameter space 



(1.4) 



R* n — R* n {M n ^ p {rj)) = inf sup R n (fi,n). 

» M n ,p(v) 



This quantity has been studied before in a related Gaussian noise set- 
ting [3], but not, to our knowledge, in an exponential noise setting. Its 
asymptotic behavior as rj — > is pinned down by the following result: 



Theorem 1.1. 



lim 



lim r 



,R* n (M n , p ( V )) 



A natural approach in this problem is simple thresholding. More precisely, 
set fit = (f2t,i)i =1 , where 



(1.5) 



Xi 



Xi > t, 
otherwise. 



For an appropriate choice of threshold t (which depends in principle on p 
and r], but not on n), this can be asymptotically minimax, as the following 
result shows: 
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Theorem 1.2. 



lim inf 



lim 



R* n (M n . p ( V )) 



Here, by "asymptotically minimax," we mean that the ratio of the worst 
risk obtained by the estimator to the corresponding minimax risk tends to 
1 as n — ► oo, followed by r) — > 0. 

The minimizing threshold to = to(p, rf) referred to in this theorem behaves 

as 

to(p,r]) ~plog(l/7?) + ploglog(l/r/) • (l + o(l)), rj -> 0. 

In order to have asymptotic minimaxity, it is important to adapt the thresh- 
old to the sparsity parameters (p,rj). 

1.2. FZ?i? thresholding. FDR-controlling methods were first proposed in 
a multiple hypothesis testing situation in [1, 2]. For the exponential model we 
are considering, we suppose that there are n independent tests of unrelated 
hypotheses, Hq^ versus H\^, where the test statistics Xi obey the conditions 

(1.6) under H 0>i : X i ~'Exp(l), 

(1.7) under H lti : Xj~Exp(/ii), ^ > 1, 

and it is unknown how many of the alternative hypotheses are likely to be 
true. Select a value q, < q < 1, which Abramovich et al. [1, 2] called the 
FDR control parameter. If we call any case where Hq^ is rejected in favor of 
Hij a 'discovery,' then a 'false discovery' is a situation where Hq^ is falsely 
rejected. An FDR-controlling procedure controls 



S 



7^{False Discoveries} 
#{Total Discoveries} 



< 



Simes' procedure [17] was shown by [4] to be FDR-controlling and it is easy 
to describe. We begin by sorting all of the observations into descending 
order, 

-^(1) > X(2) > •> X(ny 

Next, compare the sorted values with quantiles of Exp(l); more specifically, 
if E(t) denotes the standard exponential distribution function and E = 1 — 
E the corresponding survival function, compare (Xh \ , X/ 2 ) ? • ■ • > ^(n) ) with 
(ti,t 2 , ■ ■ .,t n ), where 

t k =E- 1 (q-^\=-log(q-^\, l<k<n, 
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and let to = oo. Finally, let k = kp^n be the largest index k > 1 for which 
X(k) — ^fe) with k = if there is no such index. The FDR thresholding es- 
timator fig^ R uses the (data-dependent) threshold i FDR = tk FDR and has 
components (/ij)" =1 , where 



(1.8) h 



Xi, Xi> t 



FDR 

1, otherwise. 



In particular, if kFDR — 0, then £ii = 1 for all i. We think of the observations 
exceeding t FDR as discoveries; the FDR property guarantees relatively few 
false discoveries. 

An attractive property of the procedure is its simplicity and definiteness. 
Another attractive property is its good performance in an estimation con- 
text. Our main result in this paper is the following theorem: 

Theorem 1.3. 1. When < q < \, the FDR estimator fi F ® R is asymp- 
totically minimax, that is, 



lim 



lim SUp ^ eMn -p('') ^" n ^f L q,n jA 1 ) 



n— >oo 



1. 



2. When q> \, the FDR estimator fiq® R is not asymptotically minimax, 
that is, 



lim 



lim sup ^ eA WW ^"(^J^'^ 



R* n (M nyP ( V )) 



FDR 

> 1. 



1 



1.3. Interpretation. By controlling the FDR so that there are at least as 
many 'true ' discoveries above threshold as 'false ' ones, we obtain an estima- 
tor that with increasing sparsity r] — > 0, asymptotically attains the minimax 
risk. This is the case across a wide range of measures of sparsity. 

The same general conclusion was found in a model of Gaussian obser- 
vations due to Abramovich, Benjamini, Donoho and Johnstone [3]. In that 
setting, the authors supposed that X{ ~ A r (/ij, 1) and that the /ij are mostly 
close to zero so that ^(Ya=i lA*i| P ) — 7 7n- (Note that the sparsity parameter 
r/ was replaced by a sequence r\ n — > as n — > oo in [3].) In that setting, it 
was shown that FDR thresholding gave asymptotically minimax estimators. 
Hence, the results in our paper show that FDR thresholding, known previ- 
ously to be successful in the Gaussian case, is also successful in an interesting 
non-Gaussian case. 

It appears to us that there may be a wide range of non-Gaussian cases 
wherein the vector of means is sparse and FDR gives nearly-minimax results. 
Elsewhere, Jin [12] will report results showing that similar conclusions are 
possible in the case of Poisson data. In that setting, we have, for large n, n 
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Poisson observations Ni ~ Poisson(//j) with the \Xi mostly 1 and perhaps a 
small fraction significantly greater than 1. In that setting as well, it seems 
that FDR thresholding gives near-minimax risk. 

In fact, the approach developed here seems applicable to a wide range 
of non-Gaussian distributions and loss functions. At the same time, it also 
seems able to cover a wide range of dependence structures. 

1.4. Contents. The paper is organized as follows. Theorems 1.1 (on min- 
imax risk) and 1.2 (on thresholding risk) are developed and proved in Sec- 
tions 2 and 3, respectively. These sections also introduce a model in which 
the parameter \x is realized by i.i.d. random sampling rather than as a fixed 
vector; this model is very useful for computations. 

Sections 4-7 develop our technical approach to analyzing FDR threshold- 
ing. This starts in Section 4 with a definition and analysis of the so-called 
FDR functional, establishing various boundedness and continuity proper- 
ties. The FDR functional allows us to articulate the idea that in a Bayesian 
setting where both the mean vector \i and the subordinate data X are re- 
alizations of iid random variables, there is a 'large-sample threshold' which 
FDR thresholding is consistently 'estimating.' Section 5 discusses the per- 
formance of an idealized pseudo-estimator which thresholds at this large- 
sample threshold even in finite samples; it shows that the idealized 'esti- 
mator' achieves risk performance approaching the minimax risk. Section 6 
shows that in large samples, the risk of FDR thresholding is well approxi- 
mated by the risk of idealized FDR thresholding. Section 7 ties together the 
pieces by showing that the results of Sections 4-6 for the Bayesian model 
have close parallels in the original frequentist setting of this introduction, 
implying Theorem 1.3. 

Section 8 ends the paper by (i) graphically illustrating two important 
points about the method and the proof below, (ii) by comparing our results 
to recent work of Genovese and Wasserman and of Abramovich et al. and 
(iii) describing generalizations to a variety of non-Gaussian and dependent 
data structures. 

1.5. Notation. In this paper, we let E denote the cumulative distribu- 
tion function (cdf) of Exp(l), while, to avoid confusion, we use £ for the 
expectation operator applied to random variables; we also let E denote the 
survival function of Exp(l) and extend this notation to all cdf 's; that is, for 
any cdf G, we let G = 1 — G denote the survival function. 

We let denote the scale mixture operator, mapping any (marginal) 
distribution F on [1, oo) to a corresponding G = E#F on [0, oo), according 
to 



F 




G: 
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Note here that G is the cdf of a scalar random variable X, with \x a random 
variable fj, ~ F and X\fi ~ Exp( / u). We let T denote the set of all eligible 
cdf's, 

T={F:P F { l i >1} = 1}, 
and T p {r\) denote the convex set of pth moment-constrained cdf's, 

(1.9) TM = |f e T: J log p (H dF(fi) < rfj, < p < 2. 

We also let Q denote the collection of all scale mixtures of exponentials, 

g = {G:G = fi#F,Fef}, 

and let Q p {r]) denote the subclass where the mixing distributions obey the 
moment condition £[log p (H] < rj p , 

(1.10) g p ( V )=E#F p (ri) = {G:G = E#F,F£F p ( V )}, 0<p<2. 

In this paper, except where we explicitly state otherwise, the cdf's F and G 
are always related by scale mixing, so 

G = E#F. 

(The relation F \— > Ej^F is one-to-one.) We often use G and G n together, 
always implicitly assuming that they are related as the theoretical and em- 
pirical cdf of the same underlying samples so that G n is the empirical dis- 
tribution for n i.i.d. samples ~ G, where 

1 n 

Gn(t) = -J2l {Xi<t} . 
i=l 



2. Asymptotics of minimax risk. In this section, we prove Theorem 1.1. 
As usual, R* n {M) — sup^gfj /? n (7r) , where /9 n (7r) denotes the Bayes risk 
£tt£/J^|| log/*7r — log HI 2] w itb M random, fj, ~ 7r; /}„- denotes the Bayes es- 
timator corresponding to prior it and £ 2 loss and II denotes the set of all 
priors supported on M [here, M = M n>p (j]), as in (1.3)]. Throughout this 
paper, we always implicitly assume that P ni {fj,i > 1} = 1, where 7Tj is the ith 
entry of it. 

As in [7], we obtain a simple approximation of -R* by considering a 
minimax-Bayes problem in which ji is a random vector that is only required 
to belong to M on average. We define the minimax-Bayes risk as follows: 



(2.1) 



R* n (M p , n { V ))=mfsup{ £ v £^ 



1 



n 



log/t-logHIl 



n 



E lo g p 



1=1 



8 D. DONOHO AND J. JIN 

Since a degenerate prior distribution concentrated at a single point fi G 
M p ^ n (rj) trivially satisfies the moment constraint, the minimax-Bayes risk is 
an upper bound for the minimax risk, that is, 

(2-2) R* n (M n , p (rj))<R* n (M n ^)). 

In fact, for large n, we have asymptotic equality; in Section 2.1 we will prove 
the following: 

Theorem 2.1. 

n-oo R* n (M n , p (rj)) 

Consider a univariate decision problem with data X a scalar random vari- 
able, with \i a random scalar satisfying [i ~ F and X|u ~ Exp(/i). The cor- 
responding univariate minimax-Bayes risk is 

(2.3) p(rj) =p P (rj) = inf sup ^^(logtfpf) - log//) 2 . 

The univariate and n-variate minimax risks are closely connected; in Sec- 
tion 2.2, we will prove the following: 

Theorem 2.2. R* n (A/„, p (r/)) = p P (v)- 

The univariate minimax-Bayes risk has a simple asymptotic expression as 
given by the following result: 



Theorem 2.3. For0<p<2, 

3 g 2 ^iogiy 



lim ( 

n 



oV?7Plog 2 ^logi. 
Theorem 1.1 follows immediately by combining Theorems 2.1-2.3. 

2.1. Proof of Theorem 2.1. Because (2.2) gives half of what we need, our 
task is to establish an asymptotic inequality in the other direction. We use 
a strategy similar to that of [7]. 

Now, for fixed 77, choose < £ <C r\ and construct the product distribution 

IlJ-C = ntl <- f > where & ~ ■/" lo S P ^) d7r * = h- P > 1 < ^ < and 

7r* is least favorable for univariate Bayes Minimax problem (2.3), so nl_£ 
is least favorable for the n-variate Bayes Minimax problem (2.1). Let A n = 
Ya=i l°g P ^« < '7 P }- We then construct a new prior, fil^ = ni_^( , |>l n ). 
By the law of large numbers (LLN), 

(2.4) P(A n )^l, 
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while under nl_^, we have \x S M n>p (r)), that is, suppld_£ C M nj p(rf). As 
the minimax risk is the supremum of Bayes risks, we have 

(2-5) K>Pn(^;\). 
Now, for any constant vu > 1, and with L(-, •) the loss function 

1 n 



£(A,m) = -5Z( lo S^ - log Mi)' 
n r— : 

define the ^-truncated loss function, 

1 n 

L (u;) (m,m) = - Vmin{(logMi -logM;) 2 ,^}- 



n . 

2=1 



Clearly, 

(2-6) p n (il^ c ,L)>p n (I^2 ( ,L^), 

where p n (ir,L) denotes the Bayes risk with respect to the loss function L. 
With || • \\tv denoting the variation distance, the definition of LT^_^ and (2.4) 
give 

\\n { rpt-n { ;2 c \\ TV <i-P(A n )^o. 

For variation distance, \£pf — £q/\ < ||/||oo ■ \\P — QIItv! thus, for any fixed 
w, the Bayes risk 

|p n (n^ C) - p n (n^ c , < w . (i _ P{An)) as n - oo. 

On the other hand, for L or L", the coordinatewise separability of the 
loss and the independence of the coordinates ensure that the per-coordinate 
Bayes risk does not depend on the number of coordinates, that is, 

Pn {^l L) = Pi (k;_ l), p n (n ( ^ L^) = p 1 (-K* ri _ L^). 

We conclude that for each w > 0, 

p n (U^\,L^) ^ P1 (^ L^) as /woo. 

Using monotone convergence of — > L as w — > oo , we have 

Pifr^LM) - pi(vr*_ ? ,L) = p( V - C), 
so from (2.5)-(2.6), 

K>p(v-0- 

Now, />(?]) is monotone and continuous as a function of rj; thus, by letting 
C — > 0, we have 

R*>p(r))=Rl 
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2.2. Proof of Theorem 2.2. First, observe that by the coordinatewise- 
separable nature of any estimator 5 = 5 n for p and the i.i.d. structure of the 

X%/ Pi, 

(2.7) ^£ 7V £ l ,\\logS n -logp\\ 2 2 = ^J2 fs^\logS{Xi)- log fH] 2 ni(dfH) 

i 

(2.8) = ^f£^ogS(X 1 )-logp 1 ] 2 ("£^)(dpi) 

i 

(2-9) =£ F ^ 1 [log5(X 1 )-log f i 1 } 2 , 

where F w = \^Z^i{dpi) i s a univariate prior. Second, observe that the mo- 
ment condition on tt can also be expressed in terms of F n since 

(2.10) -f^log"^ = -E / log p (^)^(^) = / log p (Mi) J P 7r (d/ii), 

i 

thus, fj? 7r log p /xi < Theorem 2.2 is easily derived from (2.7)— (2.10). In- 
deed, let (F°,5°) be a saddlepoint for the univariate problem (2.3), that is, 
5° is a minimax rule, F° is a least favorable prior distribution and 5° is 
Bayes for F°. Let F 0,n denote the n-fold Cartesian product measure derived 
from F° and 5°' n the n-fold Cartesian product of 5°. From (2.10) and (2.7), 
F°' n satisfies the moment constraint for R^{M njP {rf)) and 

~£ F o,»£ M || log<5°' n - log^Hl = p p (rf). 
n 

To establish the theorem, it is enough to verify that (F 0,n , 5°' n ) is a saddle- 
point for the minimax problem R^M np {rj)). This would follow if for every 
7r obeying the moment constraint for R^M niP (rj)), 

£ w £^log6°' n -log p\\ 2 2 <£ F o, n £Jlog6°> n - log pf 2 . 

But (2.7)-(2.10) reduce this to the saddlepoint property of (F°,5°) in the 
1-dimensional minimax problem p p {rj). 

2.3. Proof of Theorem 2.3. The following is proved in [11], Chapter 6: 

Lemma 2.1. For functions a = a(rj) and d = d(rj) such that lim^o a (v) = 
0, lim^o d(rj) = oo and lim^oK^M??)] 1 /^" 1 ) =0, 

1 [(a/d) + y 1 - 1 ^]-! dy = d-(l + 0{(a/d) l ^ d - 1) )) asn^O. 

o 
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We now describe lower and upper bounds for p(jj), both equivalent to 
rf log 2_p (log ^) asymptotically as rj — ► 0. First, consider a lower bound for 
p{rf). A natural lower bound uses 2-point priors, 

(2.11) p(n)= sup pi(F)> sup Pi(i^), 

where F £jA1 = (1 — e)z/i + ez^ G Fpiv) denotes the mixture of mixing point 
masses at 1 and [i with fractions (1 — e) and e, respectively. The Bayes rule 
5 B (X;F £tfl ) obeys 

£ e -X/fi 

(2.12) lo g (S B (X; F €Jt )) = ^ _ - ^ log/, 
and the Bayes risk is 

roc (1 — e)e~ x ^-e~~v 

Pi{Fe,p) = (log//) / — irdx 

Jo (i_ e ) e -a; + £ e m 

r01Q , £log 2 (M) yy e iA" 1 

(2 ' 13) = ^^M(T^ + y V 

particularly, if we let /i* = /i*(r?) = log Q)/ (log log |) and e* = e*(r]) = rf / 
log p (fi*), then applying Lemma 2.1 with a = a{rf) = e*/(l — £*) and d = 
d{rj) = /i* , we have 

Pl(*W„),,»«(„)) = (r? p log 2 " p logi) • (1 + o(l)) 
and obtain the desired lower bound 

(2.14) p(r,)>pi(F e , M ^ {v) ) = ^flog 2 ^log^j. (1 + 0(1)). 

We obtain an upper bound by considering the risk of thresholding. Define 
the univariate thresholding nonlinearity 

1, otherwise. 



(2.15) 6t(x) 



Then with thresholding estimator 5t(X) based on scalar data X obeying 
X\ii ~ Exp(jLi), where the scalar ii is distributed according to a prior F G 
J- p (rj), the univariate Bayes thresholding risk is 

PT (t, J F)=£:(log(^(X))-log(/i)) 2 . 
We are particularly interested in the specific threshold 



i( +ploglogQ^ + W log log Q 
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The worst-case univariate Bayes risk for this rule is 

(2.16) p T (t ,r])=p(t ,r];p)= sup p T (t ,F). 

As the minimax rule is at least as good as any specific rule, we have 

(2.17) p(v)<pT(to,v)- 

Now, in the proof of Theorem 1.2 below, we show that the thresholding risk 
obeys 



(2.18) 



PT(to,V,P) < if lo g p !og-( 1 + as 77^0. 

V 



Combining the lower bound given by (2.14) and the upper bounds given 
by (2.17)-(2.18), we obtain Theorem 2.3. 

3. Asymptotic minimaxity of thresholding. We now prove Theorem 1.2, 
showing that thresholding estimates can asymptotically approach the mini- 
max risk. 

3.1. Reduction to univariate thresholding. In effect, we need only prove 
(2.18). We first recall why this establishes Theorem 1.2. Again, let fit denote 
the thresholding procedure on samples of size n. Trivially, for any t and n, 
the risk of thresholding at t exceeds the minimax risk 

sup R n (p t ,p) > R* n (M n<p (ri)). 

Theorem 1.2 thus follows from an asymptotic inequality in the other direc- 
tion, 



lim sup inf 



lim sup ■ 



R* n (M n , p (rj)) 



< 1. 



(3.1) 
If we take 

(3.2) t =to(p,v) =p!og(lA?) +ploglog(l/?7) + y / loglog(l/r?), 
then by Theorem 2.1 and Theorem 2.2, (3.1) reduces to 



(3.3) 



lim sup 



p(v) 



< 1. 



Consider the worst Bayes risk of fit with respect to any prior /i ~ tt, 
where ir is the distribution of a random vector which is only required to 
belong to M njP on average, 

K(Pt ,v) = K(Pt ,V,P) 
(3-4) 



sup< SnSu 



1 



n 



iog/ito — log /x|| 2 



1 

for vr : £ n - V log p Pi<rf\. 



FDR THRESHOLDING: EXPONENTIAL DATA 



13 



Now, since degenerate prior distributions concentrated at points \i £ M p ^ n (rj) 
trivially satisfy the moment constraint J- p {rj), we have 

(3.5) sup R n (fi to ,^)<Rl(fL to ,v)- 

M n ,p(li) 

Consider also the worst univariate Bayes risk (2.16) of the scalar rule 6t Q (X), 
as in (2.15), with respect to univariate prior F E J-p{rj). As in the proof of 
Theorem 2.2, it is not hard to show that the minimax multivariate Bayes 
risk is the same as the minimax univariate Bayes risk 

(3-6) Rn(fk ,r)) = pr(to,v)- 

Hence, we now see that given (2.14), the matching upper bound (2.18) im- 
plies that 

(3.7) lim = 1. 

Combining (3.5)-(3.7) yields (3.3) and Theorem 1.2. We thus turn to (2.18). 

The univariate Bayes risk for thresholding at t can be decomposed into a 
bias proxy and a variance proxy as follows: 



p T (t,F)= |(log / u) 2 (l-e-^)dF( / u)+ J j™\og 2 {x)e- x dx 



dFQjL), 



b(t,ti)dF(fi) + J v(t,n)dF(jj), 
say. We now proceed to show that as rj — > 0, 

(3.8) sup / b(t , /i) dF(n) < rf log 2 "" log - 

and 

(3.9) sup fv(t ,fi)dF(fi) = o(rflog 2 - p log- 

Together, these imply (2.18). 

3.2. Maximizing linear Junctionals over T v {r\) . The relations (3.8)-(3.9) 
concern maximization of functionals over cdf's of moment-constrained scale 
mixtures. We now approach this problem from a general viewpoint, looking 
ahead to maximization problems in later sections. 

Consider two functions ip(fi),cf)(fi) in C[l, oo) n C 2 (l, oo). Suppose 

(a) (j) is strictly increasing and 0(1) = 0; 

(b) ij) is bounded, ^(1) =0, ip > but tp is not identically 0; 

(c) lim IJr ^ oo [ip(n)/(f)(n))=0. 
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m 

Fig. 1. Generalized convex envelope 9(z) for the case Iim (J _ > i+[^)(/it)/0(/i)] < oo in the 
(fi^ip plane. In this example shown here with lim M _ > i+[V'(/ i ) / ^(a 1 )] = 0, the thinner curve is 
{(4>(fi), tp(/j,)) : \x > 1}. When < z < 4>(/j.*), ^(z) is a linear function of z and is illustrated 
by the line segment. The case z > is not discussed. 

We are interested in the following maximization problem: 

(3.10) *(z) = sup{| ^)dF{n): J <f>(ji)dF{ji) < *}. 

In the case (f)(fi) = /x, ^(z) is the usual convex envelope of tp, that is, ^(z) 
traces out the least concave majorant of the graph of The next two 
lemmas describe the computation of the envelope. 

Lemma 3.1. Suppose lim^i + [ip(fi)/(j)(fi)] exists and the limit is strictly 
smaller than ^* = sup„ >1 {'0(/x)/<^(/x)}. Set 

fi* = n* = maxf/* > 1 : il>(ji)/</>(ji) = }• 

Then for any < z < <j)(fi*), ^(z) = ^f* ■ z and is attained by the mixture 
of point masses at 1 and /i* with masses (1 — e{z)) and e(z), respectively, 
where e(z) = e(z\ tp, 4>) = z/(p(fi*). 



See Figure 1. 
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Lemma 3.2. Suppose that lim«_»i_|-[T/>(ju)/0fju)] = 00 and suppose there 
exists fi = p,(ip,(f)) > 1 so that (?// (n)/4>' (/x)) is strictly decreasing in the in- 
terval (l,p] and, finally, that ip' (fi) / (ft' (fi) < fy**(ft), where 

(3.11) *** ( M ) = *** (,x; p,,(f),ip) = sup ffi^ ~ y } , 1</*<A- 

TAien t/iere is a unique solution = fJ,*(ip, (f>) to the equation 
***(/i) = ^(/i)/0'(/i), K/i</i; 

moreover, letting 

" =max r- /i: w-^ = ^ (/i *T 

i/ien to/ien < z < (j>(fJ>*), ^(z) = ip((/)~ 1 (z)) and is attained by the sin- 
gle point mass Vn z with [i z = <j)~ l {z) and when <^>(/i*) < -2 < 4>{ijl*), ^f(z) = 
ip(fJ>*) + Vt** (//*)[£ — 0(/U*)] cmc/ is attained by the mixture of point masses 
at fjL^ and /j* with masses (1 — e(z)) and e{z), respectively, where e{z) = 

Notice here that the strict monotonicity of ij/ '{jj) / '(f)' over (1,/Z] is equiv- 
alent to concavity of the curve {(</>(//), : 1 < fj, < fl} in the (cj)(fj,),ip(/j,)) 
plane. See Figure 2. 

The proofs of Lemmas 3.1 and 3.2 can be found in the full version of this 
paper [6]. 

3.3. Maximizing bias and variance. To apply Lemma 3.1 to the bias 
proxy, set V = VV/CaO = & ( t o,/") = log 2 (Ai)(l - e~ to ^), <j>{n) = \og p {p) and 
*&(z), as in (3.10). Then the worst bias sup^r / b(to,/j,) dF = ^(rf). Direct 
calculation shows that for large to, 

(j,* = axgmax[^(/i)/0(//)] 



log log t - log(2-p) 

and 

** = *™ - kl?p) ~ log2 ~ P '° ~ log2 ~" log (^ 

It is obvious that for sufficiently small 77, r? p < 0(/J*); thus, by Lemma 3.1, 
\&(r] p ) = ** • rf and relation (3.8) follows directly. 

Now consider the variance proxy. Let ip(n) = ^77 (/•*) — v (to, A 4 ) — f(£oj 1)> 
</>(//) = log p (/i) and again with ^(2) as in (3.10), the maximal variance 
proxy supjr / v(to,fx)dF = ^(rf) + v(to, 1). Notice here that v(to, 1) = 

o(r7 p log 2_p (log -)), so to show relation (3.9), we need only demonstrate that 
(3.12) y(r, p ) = 0(if). 
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4.5 - 




m 

FlG. 2. Generalized convex envelope ^(z) for the case lim fl ^i+[i/)(^t)/(^(/i)] = oo 
in the <f>-ip plane. The thinner curve is tp(/J.)) : /i > 1}. When < fi < (J,*, 

{(4>(fi), ^(fJ,)) : < n < /i*} traces out the same curve as that of {{cj>{^), VHm)) : < /i < 
and when fi, < /i < fi* , ^(z) is a linear function of z = 4>(fi) which is illustrated by the 
line segment. The slope of the line segment equals that of the tangent at fi, of the curve 
{(4>(fi),ip(iJ,)) : \i > 1}. The case z > <f>(fi*) is not discussed. 



Direct calculations show that 

0<p< 1, 



(3.13) lim 




Kp<2, 

so we will calculate *&(z) for the cases < p < 1 and 1 < p < 2 separately. 

When <p< 1, let c = J^ 00 log 2 (x)e _:r dx and note that for sufficiently 
large to, the condition of Lemma 3.1 is satisfied; moreover, direct calculations 
show that 

// = argmaxM/,)/^)} ~ to, = -^p- ~ — ^— ; 

M>1 log P (/i*) log P (t ) 

for sufficiently small we have rf < 0(/i*), so by Lemma 3.1, \&(?7 P ) = \E r * -t/ 5 
and (3.12) follows directly. 

When 1 < p < 2, if we let ft denote the smaller solution of the equation 
^■log(/i) = (p — 1), then for large t$, ft ~ 1 + moreover, by elemen- 
tary analysis, [tjj' (/j.) / (j)' (fj,)] is strictly decreasing in (1,/z] and ip' \p) / '(£>' '(p) < 
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\£**(/2) and the condition of Lemma 3.2 is satisfied. Furthermore, for large 
to, 

(3.14) ***(^_£_ VK/x</2. 

More elementary analysis shows that 

H = argmax — — — - ~ argmax — — ~ t 

n>fi PW ~ <?K/M m>A PW 

and 

^ = exp( [ct log 2+ P toe"' /p] ll{p ~ 1] ) , 

^(^) = [ct log 2+p t e-' /p] p/(p " 1) - 
It is now clear that for sufficiently small rj > 0, <p(fi*) < tf < 0(/U*). Thus, 
by Lemma 3.2, 

(3.15) *(rf) = ^M*) + ***(»*)(rf> - log(/i*)). 
Taking = in (3.14) and (3.15) gives (3.12), since 

*(rf) = 4>(^) + ***(^)[7f - <KM*)] ~ rf = o(rf ). 

4. The FDR functional. We now come to the central idea in our analysis 
of FDR thresholding — to view the FDR threshold as a functional of the 
underlying cumulative distribution function. For any fixed < q < 1, the 
FDR functional T q {-) is defined as 

(4.1) r„(G) = inf|t:G(t)>i^)} > 

where G is any cdf. 

The relevance of T q follows from a simple observation. If G n is the empiri- 
cal distribution of X\,X2, ■ • ■ , X n , then T q {G n ) is effectively the same as the 
FDR threshold i FDR (Xi, . . . , X n ). More precisely (see Lemma 6.1 below), 
thresholding at T q (G n ) and at i FDR (Xi, . . . , X n ) always gives, numerically, 
exactly the same estimate £i q< n- 

In this section, we consider several key properties of this functional. 

4.1. Definition, boundedness and continuity. We first observe that T q (G) 
is well defined at nontrivial scale mixtures of exponentials. 

Lemma 4.1 (Uniqueness). For fixed < q < 1 and for all G EG, G / E, 
the equation 

(4.2) G(t) = -E(t) 

q 

has a unique solution on [0, oo) which we denote T q (G). 
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Proof. Indeed, with fi a random variable greater than or equal to 1, 
G(t) = £[E(t/ (j,)]. Hence, if fi / 1 a.s., then for some no > 1 and some e > 0, 
we have that for all_t > 0, G(t) > eE(t/fi )- Now, (5(0) < E(0)/q, while for 
sufficiently large t, E(t)/q < eE(t/fi ). Hence, for some t = t on [0, oo), (4.2) 
holds. Now, consider the slope of G(t), 

-j t G{t) = em/Mr] < em/n)] = G(t). 

Compare this with the slope of E(t)/q. We have 

dtq y ' q W 

At t = t , ±E(t) = G(t), so 



>0. 

t=t 



In short, at any crossing of G — -E, the slope is positive. Downcrossings being 
impossible, there is only one upcrossing, so the solution (4.2) is unique. □ 

The ideas used in the proof immediately lead to two other important 
properties of T q . 

Lemma 4.2 (Quasi-Concavity). The collection of distributions G £ Q 
satisfying T q (G) =t is convex. The collection of distributions satisfying T q {G) > 
t is convex. 

Proof. The uniqueness lemma shows that the set T q {G) = t consists 
precisely of those cdf's G obeying G(t) = e~ l /q; this is a linear equality 
constraint over the convex set Q and defines a convex subset of Q. The set 
T q (G) > t consists precisely of those cdf's G obeying G(t) < e~ t /q; this is 
a linear inequality constraint over the convex set Q and generates a convex 
subset. □ 

We also immediately have the following: 

Lemma 4.3 (Stochastic Ordering). We introduce the following notation 
for cdf's: G < G x if > G (t) for allt>0. Then 

G <G 1 T g (Go)>T 9 (G 1 ). 

We now turn to boundedness and continuity of T q . Recall that the Kolmogorov- 
Smirnov distance between cdf's G and G' is defined by 

IIG-G'H = sup | -G'(t) | . 
t 
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Viewing the collection of cdf 's as a convex set in a Banach space equipped 
with this metric, the FDR functional T q (-) is, in fact, locally bounded over 
neighborhoods of nontrivial scale mixture of exponentials. 



Lemma 4.4 (Boundedness). For G eG, G^E 
-log(j4^||G-£||) <T q {G)<- 



q \\G-E\\ 

Proof. We introduce the shorthand notation r = T q (G). The left-hand 
inequality follows from G(t) = E(r)/q, which gives 

||G - E\\ = sup|G(t) - E(t)\ > G{t) - e~ T = ^—^-e~ T . 
t q 

For the right-hand inequality, again use G(t) = E(r)/q and convexity of e 
to obtain 

— — — t 

At the same time, since E <G, we have ||G — E\\ = sup t>0 J[e f — e ] dF. 

— — 

Observe that as a function of t, J[e p — e ]dF has a unique maximum 

1 — — t 
point t = t satisfying J -e n dF = e , so 

\G-E\\ = / \e~» -e- l }dF= I ( 1- -]e~»dF< [(l--)dF 



HJ J \ M 



and we have r < wq^M • ^ 



In fact, the FDR functional is even locally Lipschitz away from G = E. 
Note that the image of the mapping T q : Q \— > R is the interval (log(-), oo). 

Lemma 4.5 (Modulus of Continuity). Define 

cv*(e;t ) = sup{|T 9 (G") - to | : T q (G) = t Q , \\G -G'\\<e,Ge G}. 
Then for each fixed to > l°g(l/o , )> 

(4.3) u*(e;to)< - £ . . t e to e • (1 + o(l)) as e 0. 

Iog(l/g) 

Crucially, the estimate (4.3) is uniform over {G 6 G,T q (G) < to} for fixed 
to > 0. The proof even shows that 

(4.4) to*(e;t ) <C ■ e for <e <e to , 

where C = Ct 0>q < oo if to < oo. This implies the local Lipschitz property. 



mi\^-G(t) :G(t ) = -E(t ),Geg). 
I ot t=to q ) 
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Proof. Consider the optimization problem of finding the cdf G* £ Q 
which satisfies T q (G*) = to and, subject to that constraint, is as 'steep' as 
possible at to, that is, 

(4-5) §-G*(t) 

ot t =t 

Letting <j)(p) = e" to ^ and ip(p) = (t / fi)e~ to ^ , Problem (4.5) can be 
viewed as maximizing the linear functional / dF(fi) with the constraint 
/ 0(/i) dF '{n) = |e~*°. Observe that ip' (//) / '<f>' (//) strictly decreases in \i over 
(l,oo), so in the (fi-tp plane, the curve (</>(//), VK/- 4 )) is strictly concave and 
by arguments used in the proof of Lemma 3.2, the constrained maximum of 

/ ip(l^) dF(fj,) is obtained at the point mass F which satisfies / (f){n) dF(fi) = 

Ip-to 

9 ' 

It thus follows that the solution to Problem (4.5) is G* to {t) = e~*/ M * for 
fx* = 1/(1 + log(q)/to). It has the remarkable property that if T q (G) = to, 

(4.6) G(t)<G* t0 (t), 0<t<t , G(t)>G* t0 (t), t>t . 

Indeed, letting 

h(t) = [G(t)/G* t0 (t)} -l=( e^-^dF(n) - 1, 



direct calculation shows that h(t) is strictly convex as long as Pf{^ = t 1 *} 
1 (otherwise h = 0) and (4.6) follows by observing that h(0) = h(to) = 0. 
For sufficiently small e, define t- by 

(4.7) G* t0 (t.) + e = E(t.)/q 

and define t + to be the smallest solution to the equation 

(4.8) Gt (t)-s = E(t)/T, 

see Figure 3. Now, if \\G' - G\\ < e, then by (4.6) and (4.8), 
G'(t + ) > G(t + ) - e > G* tQ (t + ) - e = E{t + )/q, 
hence, T q (G') < t + . Similarly, by (4.6) and (4.7), 

(4.9) G'(i_) < G{t-) + e < G* t0 (t.) + e = E(t-)/q. 

Observe that the function (G* (t) — E(t)/q) is strictly decreasing in the 
interval [0,io] ? s ° (4.9) can be strengthened into 

G'{t) < G{t)+e< G* t0 (t) +e< E(t)/q, < t < t_, 

hence, T q (G') > i_. It follows that 

(4.10) u(e;t ) < max{t - i-(e),t+(e) - to}- 
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Fig. 3. The dashed curve is (l/q)E(t) with q = 1/2 and the solid curve is Gt (t). In the 
plot, t~ is the solution of Gl At) +e — (l/q)E(t) and t+ is the smallest solution to the 
equation Gl Q (t) — e — (l/q)E(t). For any other G with T q (G) = to, G(t) is bounded above 
by Gt {t) when < t < to and is bounded below by G* (t) when t > to; moreover, for any 
G' with \\G'-G\\ <e, t- <T q (G')<t+. 



Finally, setting w = t+ — to, (4.7) can be rewritten as e~ w ^* — e~ w = eqe to . 
Letting w(8) denote the smaller of the two solutions to e~ w ^* — e~ w = 
S, elementary analysis shows that for small 5 > 0, w(5) ~ 6/(1 — 1//J,*) = 
5to / \og(l / q) , so as e — > 0, t + — to ~ {l/^g(l/q)) ■ toe to e and, similarly, to — 
t-(e) ~ (q/\og(\/qj) -toe to e. Inserting these into (4.10) gives the lemma. □ 

4.2. Behavior under the Bayesian model. The continuity of T q estab- 
lished in Lemma 4.5 and the role of minimax Bayes risk in solving for the 
minimax risk in Sections 2 and 3 combine to suggest a fruitful change of 
viewpoint. Instead of viewing the X{ ~ Exp(/ij) with fixed constants fj,i, 
i = 1, ... ,n, we view the \i{ as themselves sampled i.i.d. from a distribution 
F, so the Xi are sampled i.i.d. from a mixture of exponentials G = Ej^F. 
Starting now and continuing through Sections 5 and 6, we adopt this view- 
point exclusively. Moreover, for our sparsity constraint, instead of assuming 
that ^(X)r=i(l°g P (/ i i)) ff i we assume that this happens in expectation so 
that F obeys £^irlog( / ui) p < rf . We call this viewpoint the Bayesian model 
because now the estimands are random. Although it seems a digression from 
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our original purposes, it is interesting in its own right and will be connected 
back to the original model in Section 7. 

The motivation for this model is, of course, the ease of analysis. We im- 
mediately obtain the asymptotic consistency of FDR thresholding as given 
in the following: 

Corollary 4.1. For G e Q and G ^ E, the empirical FDR threshold 
T q (G n ) converges to T q (G), that is, 

lim T q (G n ) = TJG), a.s. 

n — >oc 

In a natural sense, the FDR functional T q {G) can be considered as the ideal 
FDR threshold — the threshold that FDR is "trying" to estimate and use. 

Proof. The 'Fundamental Theorem of Statistics' (for example, [16], 
page 1) tells us that if G n is the empirical cdf of X\, X2, ■ ■ ■ , X n i.i.d. G, 
then 

(4.11) \\G n -G\\^0, a.s. 

Simply combining this with continuity of T q {G) at G 7^ E gives the proof. 
□ 

Of course, we can sharpen our conclusions to rates. Under i.i.d. sampling 
Xi ~ G, \\G n - G\\ = P {n~ 1 / 2 ). Matching this, we have a root-n rate of 
convergence for the FDR functional. 

Corollary 4.2. IfGeG and G^E, then 

\T q (G n )-T q (G)\=0 P (n- 1 / 2 ), 
where the Op() is locally uniform in G. 

Proof. Indeed, 

\T q (G n )-T q (G)\ <cu*(\\G n -G\\;T q (G))=u;*(Op(n~ 1 ^y,T q (G)). 

By (4.4), for small e > 0, uj*(e;T q (G)) < Cqs, where Cq locally bounded 
when G 7^ E. Therefore, this last term is locally uniformly Op(n~ 1 / 2 ) at 
each GeQ where G^E. □ 

We can, of course, go further. By Massart's work on the DKW constant 
[9, 15], we have 

(4.12) P{\\G n -G\\ >s/^/E}<2e~ 2s \ Vs > 0, 

which combines with estimates of uj* to control probabilities of deviations 
T q (G n )-T q (G). 
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5. Ideal FDR thresholding. Continuing now in the Bayesian model just 
defined, we define the ideal FDR thresholding pseudo-estimate /2 g ,n, with 
coordinates (fii) given by 

Xi, Xi> T q (G), 
1, otherwise. 



(5.1) Ai 



In words, we are thresholding at the large-sample limit of the FDR proce- 
dure. 

Note that T q (G) depends on the underlying cdf G, which is actually un- 
known in any realistic situation. jl qn is not a true estimator; it could only 
be applied in a setting where we had side information supplied by an oracle 
which told us T q {G). We view fl q ^ n as an ideal procedure and the risk for 
t^q,n ideal risk — the risk we would achieve if we could use the thresh- 

old that FDR is 'trying' to 'estimate.' Despite the gap between 'true' and 
'ideal,' fl g>n plays an important role in studying the true risk for (true) FDR 
thresholding. In fact, we will eventually show that, asymptotically, there is 
only a negligible difference between the ideal risk for p, q ^ n and the (true) risk 
for the FDR thresholding estimator p>q n . Let 1Z n (T q , G) denote the ideal risk 
for ji q)n in the Bayesian model, 



TZ n (T q ,G) = -£ 
n 



^Z{^g{ft q ,n)i - log my 



=1 

Arguing much as in Sections 2 and 3 above, in the Bayesian model, we also 
have the following identity with univariate thresholding risk: 

(5.2) K n (T q ,G)= PT (T q (G),F). 

Since this ideal risk depends only on a univariate random variable X\ ~ G 
and T q (G) is nonstochastic, its analysis is relatively straightforward. Also, 
we can now drop the subscript n from lZ n . 

Theorem 5.1. Fix < q < 1 and <p < 2. 
1. Worst-case ideal risk. We have 

1, 0<g<i 



(5.3) lim 



7f l0g 2 - p l0g± 



l-q 

2. Least favorable scale mixture. Fix < s < 1. Set 



\<q<l. 



M6 =f4( r l) =1 °g(-J J loglogQ^), /4 = /4(r7) = log^ -loglog^-J 
and 

G £ ^ = (1 - e)E(.) + eE{-/v), e ■ log^/i) = if. 
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Define 



fi = fi(ri;q,s) = I fj*(rj), 

{(l-s)-fi* b ( V ) + s-fi* v ( V ), 0<s<l, 




Then G e ^ is asymptotically least favorable for T q , that is, 



lim KjTcnGe^) = 



n^[ S up GegpM TZ(T q ,G)\ 



By Theorems 2.1-2.3, the denominator on the left-hand side of (5.3) is 
asymptotically equivalent to the minimax risk in the original model of Sec- 
tion 1. In words, the worst-case ideal risk for the i.i.d. sampling model is 
asymptotically equivalent to the minimax risk (1.4) as r] — > 0. This, of course, 
is no accident; it is a key step towards Theorem 1.3. 

5.1. Proof of Theorem 5.1. We now describe, in a series of lemmas the 
ideas for proving Theorem 5.1. In later subsections, we prove the individual 
lemmas. 

Since the ideal risk lZ(T q ,G) is, by (5.2), reducible to the univariate 
thresholding Bayes risk which we studied in Section 3, we know to split 
the ideal risk 7Z(T q ,G) into two terms, the bias proxy and the variance 
proxy, 



Note that as G tends to E, Lemma 4.4 implies that T q (G) — ► oo. Since 
v(T q (G)) decreases rapidly, the key to majorizing the variance is to keep 
T q (G) small, motivating study of 




Consider V(T q , G). Asymptotically as rj — > 0, every eligible F G ^Fpiv) puts 
almost all mass in the vicinity of 1, so 




Lemma 5.1. Asrj^O 




(5.5) 



-=r*(r/;p)= inf T q (G). 
Geg p {n) 
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Lemma 5.2. Asrj^O, 

T*=T*( V ;p)= P (log- + logloglog-^ +log(^—^ + o(l). 

The proof is given in Section 5.2. As a direct result, we get 



iog 2 (r;) e -^ 



-J—rflogi-Plogl 
l-q rj. 
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moreover, when T q (G) exceeds T*, the variance proxy v(T*) drops and we 
obtain the following: 



Lemma 5.3. ^4s?7^0 ; 



sup V(T q ,G) 



— - — r^log 2 p log- 
1-q i] 



and 



sup _V(T q ,G) = o(rf log 2 ~ p logi 

G&g P {r,),T q {G)>T;+JT* V V 



We now study the bias proxy. The key observation is as follows: 



(5.6) 



b(t,fi) 



flogV, 



To develop intuition, consider the family of 2-point mixtures 

= {Ge,» = (1 - e)E(-) + £ E(-/ M ), elogf /i = }. 

Now, (5.6) tells us that the maximum of the bias functional over this family 

T (G ) 

is obtained by taking \i as large as possible, while avoiding qy ^ E,M <C 1; 
moreover, direct calculations show that 

T q (G e ^) _log(l+p(i-l)ilog(//)) 



(5.7) 



/' 



M-l 



so the value of fi causing the worst bias proxy should be close to the solution 
of the following equation: 

log(l+p(i-l) X lo g^)) 

Elaborating on this idea leads to the following result, to be proven in Section 
5.3: 



26 



D. DONOHO AND J. JIN 



Lemma 5.4. Asrj^O, 

sup B 2 (T q ,G)= (rflog^logiV (l + o(l)). 

Ge6 P (r;) V 11/ 

Combine the above analysis for bias and variance proxies, to give 

l + ol < % j <- + o(l) asr/^O. 

rflog^ p log^ 1-g 

Compare this to the conclusion of Theorem 5.1; we have obtained the correct 
rate, but not yet the precise constant. To refine our analysis, note that the 
worst bias and the worst variance are obtained at different values fi within 
the family Gp'°(i])- Denote the /u's causing the worst bias and the worst 
variance by ^ and fx*. Then 

log - ii 

— T^T' /4~ lo S- -log log- asr?^0. 
log log i r) rj 

Divide G P (jj) m t° two subsets, 

Gi = {Geg p ( V ),T q (G)>T* + Jr*}, 



&MGg^),t 5 (G)<t; + ^} 

and consider each separately. [Note that G^* G Q\, while G^* G Qi- Here, 
G^* and G^* are mixtures of point masses at 1 and fi living in Gp'°(rj) with 
fi = /j-l and fj,*, respectively]. Over the first subset, the variance is uniformly 
0{rj p ) and we immediately obtain 

supft(T„,G) «supB 2 (r„G) Ps?? p log 2 ~ p logi as rj -»• 0. 

For the second subset, the following lemma is proved in [6], page 22: 

Lemma 5.5. Asrj^O, 

rjP log^log-J -(1 + 0(1)), 0<q<± 

?f log 2 - p log-V (l + o(l)), \<q<l 



snpTZ(T g ,G) = { \ 7? ' r 



1-9 V 

Theorem 5.1 follows once Lemmas 5.2 and 5.4 are proved. 
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5.2. Proof of Lemma 5.2. Consider the upper envelope of the survivor 
function among moment-constrained scale mixtures, 

G* t =G* t (mp) = sup{G(t),Geg p (ri)}. 

The quantity of interest is the crossing point where this envelope meets the 
FDR boundary, 

T*=mi{t:G* t >E(t)/q}. 

Equivalently, 

(5.8) T* = M{t : [(G* t /E(t)) - 1] > (1 - ?)/?}• 

Letting 

h*(t;r ] ,p) = [(G* t /E(t))-l], 

the key to calculating T* is to explicitly express h*(t) as a function of t, 
asymptotically, for small rj. 

Calculating h*(t) again involves optimization of a linear functional over 
a class of moment-constrained cdf 's and we can apply the theory in Section 

3.2. Set V = Mv) = [e (1 "^ } * - 1] and <£(/x) = log p (/i) and define ¥ = % as 
in (3.10) so that h*(t;rj,p) = V t (if). Note that 



(5.9) lim 



IogP(M) 



0<p< 1, 
p = 1, 
Kp<2, 



so we treat the cases < p < 1 and 1 < p < 2 separately. 

When < p < 1, elementary analysis shows that for large t, 

/i" = argmax'i 



M >l I log p (/i) J p ' log p (/i*) 
so the condition of Lemma 3.1 is satisfied and 
(5.10) * t (r^)~rfeVlog p (t). 

Inserting (5.10) into (5.8) and solving for t gives the lemma for the case 
0<p< 1. 

When 1 <p< 2, direct calculations show that the function tj/ (/x) / cf>' (p) 
strictly increases in the interval (l,/2] with log(p) = log(p(i;p)) = (p — l)/i, 
also that [ip' (p,) / ()>' (ji)] < \I/**(/2), so the condition of Lemma 3.2 is satisfied. 
More calculations show first, that, 



fi* = p*(i;p) ~ argmax< 



W>ti Uog p (/x')J plog(t) 
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second, that for any 1 < fx < fj,, 
¥**(/*) =***(/x;t) 



max < - — — — — — — > ~ max ' 



W>t} I log p (//) - log p (fi) I I log'V) J logP(t) 

and, finally, that 

/l \V(p-i) 
log(/i*) = Iog(M*(t;p)) ~ (^-tlog p (t)e-'J 

since h*(t,rj,p) = ^ft(rf). By Lemma 3.2, 

r e (i-e-")t_ 1; ^P<logP(^), 

(5.11) /i*(*,77,p) = I e {l -^ ]t - 1 + - log(^)), 

I logf(^)<7f <logf(At*); 

moreover, by letting t* = t*(rj) denote the solution of log p (//*(£,p)) = rf , we 
can rewrite (5.11) as 

r (i-e-")t_ 1 t<t * 

(5.12) fc*(t;»J,p) = { ^ ' " ' 

here noting that t* ~ (p — l)plog(^) for small r/. 

Inserting (5.12) into (5.8), it becomes clear that for sufficiently small rj 
and t <t* , /i(i; rj, p) « 0. Thus, T* is obtained by equating 

Lll = e (i-^)* _ i + ***( M ,)(rf - log(^)) ~ rf e7log p (i), 

which gives the lemma for the case 1 < p < 2. □ 



5.3. Proof of Lemma 5.4- 



Lemma 5.6. For a measurable function ip defined on [l,oo), where ip > 
but is not identically and sup At>1 {^(/i)///} < oo, then for G € Q and 
< r < T q (G), we have 

[ V(/^)[e~ T/M - e - T « {G)/l *} dF < (1/q) sup {</>(//) //i} ■ r e - T /(l - e~ T )- 
7 {M>1} 

Letting r — > and combining Lemma 5.6 with Fatou's Lemma, we have 

(5.13) / ^)[1 - e- T ^'»] dF < (1/q) sup (V(m)M- 

J {M>1} 
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Proof. Let k = k (r; G) = L^^J ■ Since T q (G) >r,k > 1. Moreover, 
(5.14) / ^{fi)[e- T ^ - e - T ^ G)ltl ] dF< [ ^(/i)^ - e -( fc o+iWM] dF 



(5.15) = /v(M)(l-e- r/ ^) 



dF. 



-3=1 

We introduce the shorthand notation c = m&x f j i >i{'ip(p,)/n} and recall that 
1 — e~ x ^ 1 < x/ fi for all x > 0, so for 1 <j < ko, 

ij(fi)(l-e~ T/ ^)e- j - T/ll dF < T [ (ij(n)/n)e- j - T ^dF 



(5.16) 

< r - C - Je~ j - T ^dF. 

By definition of ko and the FDR functional, 

(5.17) J e- j - T/ »dF = G{j -T)<{l/q)e- j - T , 1 < j < k 

Combining (5.14)-(5.17) gives 

/ tp(n)[e~ T ^ - e~ T ^ G)/fl ] dF < (c/q) ■ r • ^ e _i ' T 



(5.18) 



< (c/q) -T-e- T /{l -e~ T ). □ 



We now prove Lemma 5.4. As in Section 3, let 

h = t (p, rj) =plog(l/?7) +ploglog(l/?7) + y / loglog(l/r/). 
By the monotonicity of b(t,fi) and (3.8), for sufficiently small 77 > 0, 

sup B 2 (T q ,G) < sup / b(t ,n)dF 

(5.19) Gee p (r,),T 9 (G)<i Mr?H 

= ?f log 2 - p log(l/r ? )(l + o(l)). 

Moreover, for any G with T q (G) > to, letting ip(-) = log 2 (-) and r = to in 
Lemma 5.6, we have 

0<B 2 {T q ,G)- J b{t ,fi)dF = J log 2 0u)[e-*°^ - e- T *W^]dF 

<ctoe- ta /(l-e- to ), 
where c = max^>i {log 2 (/u)///}, so it is clear that 

(5.20) sup B 2 (T q ,G)< / b{t ,n)dF + O{t e- t0 ). 

{G&g v {r)),T q {G)>t } J 



Lemma 5.4 follows directly from (5. 19)-(5. 20) and toe *° =o(rf log 2 p log(|)). 

□ 
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6. Asymptotic risk behavior for FDR thresholding. We now turn to fi q , n , 
the true FDR thresholding estimator. For technical reasons, we define a 
threshold T q ^ n slightly differently than i FDR . This difference does not affect 
the estimate. Thus, we will have fi q ^ n = £tf = (fit) with 



Aj ^ T qn , 
Aj <C Tn n . 



Our strategy is to show that the ideal and true FDR behave similarly. 

Still in the Bayesian model, we let lZ n (T q ^ n ,G) denote the per-coordinate 
average risk for fi q , n , that is, 



1 



Ttn{T q ,ni G) — —£ 



n 



^(iog(/} g , n )i -\ogmY 



i=l 



Here, again, the expectation is over (Aj,/Xj) pairs i.i.d. with bivariate struc- 
ture Xi\fj,i ~Exp(/ij). 

We will show that as n — > oo, the difference between the true risk lZ n (T q)n ,G) 
and the ideal risk TZ(T q ,G) is asymptotically negligible. We suppress the 
subscript n on 1Z n (this is an abuse of notation). 



Theorem 6.1. 



lim 

n— >oo 



su P \K(T qtn ,G)-K(T q ,G)\ 



0. 



As a result, 



lim 

n— >oo 



sup \K(T qin ,G)-K(T q ,G)\ 
GeS P (v) 



Combining Theorems 6.1 and 5.1, we have 



lim 

•n->o 



lim 



»7Plog 2 - p logi 



1 



0. 



0<q<±, 
\<q<l. 



Hence, T q ^ n asymptotically achieves the ra-variate minimax Bayes risk when 
n — > oo followed by rj — > 0. 



6.1. Proof of Theorem 6.1. We begin by defining T qn . In applying the 
FDR functional to the empirical distribution, it is always possible that 



(6.1) 



G n (t)<-E(t), foralHX), 

q 
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in which case T q (G n ) = i FDR = +00. Letting W n denote the event (6.1), 
define 

(T q (G n ), overW, 



(6.2) T, 



11 - 



q,n 



log(2), over W n . 



The following lemma, which was proven in [6, 11], shows that this definition 
of threshold gives the same estimator as T q (G n ), while obeying a bound 
which is convenient for analysis: 

Lemma 6.1. Suppose Xi ~ G, G £ Q, G ^ E and T q ^ n is defined as in 
(4.1). Then 

1. The FDR estimator is equivalently realized by thresholding at T q ^ n : 

2. f 9 , n <log(f). 

Next, we study the risk for T q ^ n . We have 



1 n 

7t{T qn ,G) = -} £ F £ l _ l 
n ~ 



l °S 2 ^)t {Xi<fq:n} + log 2 (^) l {Xl >T 9 ,„ } 



= £ F £ /t [log 2 ( m )l {Xl< f gn} +log 2 (AVw)l {Xl >f 9 , n} ] 

and 1Z(T qin ,G) naturally splits into a 'bias' proxy and the 'variance' proxy, 
as follows: 



B 2 (f q ^G)=£ F £ fl [log 2 ( f i 1 )t {Xi<fqn} ], 
V(f qin ,G)=£ F £^[log 2 (X 1 / f i 1 )l {Xi > fqn} ]. 
The comparable notions in the ideal risk case were 

B 2 (T q , G) = £ F £„[log 2 (^)l {Xl<Tq{G)} }, 
V(T q ,G) = £ F £ fl [log 2 (X 1 / fH )l {Xl > Tq{G)} ]. 

Intuitively, we expect that B 2 is 'close' to B 2 and that V is 'close' to V; our 
next task is to validate these expectations. Observe that 

(6.3) \B 2 {f q ^G) -B 2 (T q ,G)\ < fUogV)!^^} ~ W^G)}!], 

(6.4) \V(f qjn ,G)-V(T q ,G)\<£[log 2 (X 1 / f x 1 )\l {Xi< ^ 

It would not be hard to validate the expectations if \T q ^ n — T q (G)\ were 
negligible for large n, uniformly for G €Q. In Section 4, Lemma 4.5 tells us 



32 D. DONOHO AND J. JIN 

that f qtn - T q (G) is locally P (n~ 1 / 2 ) or, more specifically, 

(6.5) \T q {G)-T q (G n )\~ - f T q (G)e T *W\\G-G n \\, G + E. 

log(l/g) 

Unfortunately, for any fixed n, G might get arbitrary close to E and, as a 
result, T q (G) might get arbitrary large, so the relationship in (6.5) cannot 
hold uniformly over G £Q. 

A closer look reveals that those G's failing (6.5) would, roughly, satisfy 

T q (G)e T *M > v^, or T q {G) > log(n)/2. 

Note that as n increases from 1 to oo, {G € Q :T q {G) > log(n)/2} defines a 
sequence of subsets, strictly decreasing to 0. Motivated by this, we look for 
a subsequence of subsets of Q obeying 

(a) c c • • • C C • • • and U?° £ {n) = £; 

(b) g( n ^ approaches g slowly enough such that supg(„) [y/nT q (G)e Tq ^)] = 

(c) for large n, \1Z(T q ^ n ) — it(T q ,G)\ is uniformly negligible over g \ g( n \ 
A convenient choice is 

(6.6) g{ n) = {Geg:T q (G)<log(n)/8}, n>l. 

We expect that the difference between T q (G n ) and T q (G) is uniformly neg- 
ligible over g[ n \ that is, 

sup\T q (G)-T q (G n )\=o p (l). 

Lemma 6.2. Let A n denote the event {\f q>n - T q (G)\ < n" 1 / 4 }. Then 
for sufficiently large n, 

sup p g {a*} < 3e -mi- q ?/ew/vio S ^n)_ 

Based on Lemma 6.2, one can develop a proof for the following: 

Lemma 6.3. For sufficiently small < S < 1, 

1. lim™sup Ggg ( n) \B 2 {f q>n ,G) -B 2 (T q ,G)\ = 0; 

2. lim n ^ 0O sup Ggg ( n) \V(f q , n ,G)-V(T q ,G)\ = 0. 

As a result, lim^oo sup ( „) \ fc{T q>n ,G) -H(T q ,G)\ =0. 
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We now consider (c). Define 

(6.7) gi n) = g\g{ n) , n>i. 

Though it is no longer sensible to require that \T„{G n ) — T q (G)\ be uniformly 

negligible over Qq 1 , we still hope that T q (G n ) at least stays at the same 
magnitude as T q (G), or T q (G n ) = O p (log(n)). This turns out to be true and, 
in fact, is an immediate consequence of Massart's inequality (4.12). 

Lemma 6.4. Letting D n be the event {T q ^ n > log(n)/16}, 
sup P G {D c J = 2e~ 2 ^^) 2 /i 2 W / \ 



Combining this with Lemma 6.1, we have, except for an event with neg- 
ligible probability, 

log(n)/16<f ff , n <log(n/g). 

Since v(t, n) is monotone decreasing in t, it is now clear that both V(T qjn , G) 
and V(T q ,G) are uniformly negligible over Qq 1 ^. 




Finally, note that b(t,fj,) is strictly increasing in t, so either B 2 (T q ^ n ,G) 
or B 2 (T q ,G) will not be uniformly negligible over . However, note that 
b(t, /i) increases very slowly in t for large t, so we can expect that \B 2 (T q . n , G) — 
B 2 {T q ,G)\ is uniformly negligible over ^q"^- 

Lemma 6.6. lim n ^ 0O [sup Ggg ( n) \B 2 (f q ^ n ,G) - B 2 (T q ,G)\] = 0. 

The choice of log(n)/8 is only for convenience; a similar result holds if we 

replace log(n)/8 by clog(ra) for < c< 1/2. 

Combining the above lemmas yields Theorem 6.1. □ 

The proofs of Lemmas 6.1-6.6 can be found in the full version of this 

paper [6]. 
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7. Proof of Theorem 1.3. We now complete the proof of Theorem 1.3. 
The key point is to relate the Bayesian model of Sections 4-6 to the frequen- 
tist model of Section 1. In the frequentist model, X{ ~ Exp(/ii), 1 < i < n, 
where fi = fi2, ■ ■ ■ , Mn) is an arbitrary deterministic vector /i S M ntP (r]). 
Recall that lZ n (T q ^ n , G) denotes the risk of FDR estimation in the Bayesian 
model, while R n (fiq,ni ^) denotes the risk in the frequentist model. Below, 
we will show that 



(7.1) 



lim 

??^0 



lim 

n— >oo 



SU PG&g p ( V ) ^n(T qt n, G) 



Recall that by Theorems 1.1, 5.1 and 6.1, we have 

T, 



1. 



lim 



lim 

n— >oo 



sn VGeg p (r,) 7ln{T q: n,G) 



R^(M n ,p(rj)) 



q 



l 



o<q<h 

l<q<l, 



so Theorem 1.3 follows from (7.1). To prove (7.1), let G^ denote the mixture 
G M = - Ya=i -^("/aOi Rn{fiq,n, I*) denote the ideal risk for thresholding at 
T q (G^) under the frequentist model and let 1Z(T q ,G) again denote the ideal 
risk for thresholding at T q {G) in the Bayesian model. We have the following 
crucial identity: 



(7.2) 



Rn{fiq,n, = 7Z-(T q ,G 



V/x, n. 



Also, note that the class of G^'s arising from some \i £ M n ^{r\) is a subset 
of the class of all G's arising in G p (ri), for each n > 0. Hence, 



sup TZ{T q ,G^) 



< sup 1Z(T q ,G), 



Vn. 



However, note that by Theorem 5.1, appropriately chosen 2-point priors can 
be asymptotically least-favorable for ideal risk in the Bayesian model. By 
choosing fi which contain entries with only the two underlying values in the 
least favorable prior and with appropriate underlying frequencies, we can 
obtain 



(7.3) 



lim 



lim^oo sup^gj^^) U(T q , G^ 



1. 



Now, relating the Bayesian to the frequentist model via (7.2), we have 

lim n ^oo SUp fieMn p ( v ) Rn{flq,n, ^) 



(7.4) 



lim 



su PGee P fa)ft( r <z> G ) 



1. 



Suppose we can next show that the ideal FDR risk in the frequentist model 
is equivalent to the true risk in the frequentist model, in the same sense as 
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was proved in Theorem 6.1. Hence, 



(7.5) 



lim lim 



su P^GA/ n , p (r;) *Ml (Ag,n,M) 



= 1. 



Then (7.3)-(7.5) yield (7.1). 



The key point is that (7.5) follows exactly as in Section 6. Indeed, there 
is a precise analog of Theorem 6.1 for the relation between the frequentist 
risk and the frequentist ideal risk. This is based on two ideas. 

First, if G n now denotes the cdf of X±, . . . ,X n in the frequentist model, 
we again have very strong convergence properties of G n , this time to G^. 
This concerns convergence of the empirical cdf for non-i.i.d. samples, which 
is not well known, but can be found in [16], Chapter 25. 

Lemma 7. 1 (Bretagnolle [5]). Let X n \ , X n 2, ... , X nn be independent ran- 
dom variables with arbitrary df's F n i, let F n (x) be the empirical cdf and let 
F = Avei{F n i} . Then for all n>l, s > 0, there exists an absolute constant 
c such that 



By means of Massart's work ([16], Chapter 25 and [15]), we can take c = 1. 
Then taking F n i = Exp(/ij) and F = G„, we obtain 



This is completely parallel to the bound (4.12). 

Second, it follows immediately from Section 4's analysis that there are 
frequentist fluctuation bounds for T q (G n ) — T q (G fl ) paralleling those in the 
Bayesian case. To apply this, we define 



Prob{^\\F n - F n \\ >s}< 2ece 



P At {||G n -G M ||> S /^}<6e 



(7.6) 



Mi P (v) = G M niP (n),T g (G^ < log(n)/8} 



and 



(7.7) 



( V )=M n>p ( V )\M} l>p (r,). 



Lemma 7.2. For sufficiently small rj > 0, 

1. linin^oofsup^gjv/i pM |-Rn(/tg,n,At) ~ Rn(liq,n, = 0/ 

2. lim^oolsup^/o^) \Rn(p,q >n ,n) - R n (fj, q>n , =0. 



The proof of this lemma is entirely parallel to that of Theorem 6.1, so we 
omit it. This completes the proof of (7.1). 
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0.5 1 ' 1 1 1 1 1 

5 10 15 20 25 30 

H 

Fig. 4. Simulation results for FDR thresholding. Curves (dashed, solid, cross, and dia- 
mond) describe per- coordinate loss of the FDR procedure with different q values ft? = 0.05, 
0.15, 0.25,) at different two-point mixtures. Here, the mixtures concentrate at 1 and with 
mass e = r)/\og{p) at /j,. The horizontal line corresponds to the asymptotic risk expression 
fjloglog(i). 

8. Discussion. 

8.1. Illustrations. We briefly illustrate two key points. 

First, we consider finite-sample performance of FDR thresholding. Fig- 
ure 4 shows the result of FDR thresholding with various values of q. It used 
a sample size n = 10 6 , sparsity parameters p = 1, rj = 10~ 3 and a range of 
two-point mixtures of the kind discussed in Theorem 5.1. The figure com- 
pares the actual risk of the FDR procedure under a range of situations with 
the asymptotic limit given by Theorem 1.3. Clearly, the risk depends more 
strongly on q in finite samples than seems called for by the asymptotic ex- 
pression in Theorem 1.3. In the simulations, the mixtures were based on 
various (e, fx) pairs with \i ranging between 2 and 30 and where, for each fj,, 

For each q £ {0.05,0.15,0.25,0.5}, we applied the FDR thresholding esti- 
mator fiqn, obtaining an empirical risk measure 

R(q, fi) = R{q, fi; rj, n) = -\\ log fl q>n - log/x]||. 
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(a) 




(b) 

Fig. 5. Panel (a): The 'bias proxy' B 2 {T q ,G e ,p) and the 'variance proxy' V{T q ,G E ,i^) • 
Panel (b): Enlargement of (a). The maxima ofB 2 (T q ,Ge,„) and V(T q ,G E ^) are obtained 
roughly at ^ and (£, respectively, with ^ = log(±)/logloglog(i), /4 = log(±) -loglog(±). 
In this figure, rj — 10 -6 . 



Figure 4 plots R(q, fi;rj,n) versus [i for each q. As /i varies between 2 and 
30, the empirical FDR risk first increases to a maximum, then decreases; 
this fits well with our theory. We also note that for q smaller than 1/2, the 
empirical FDR risk is not larger than r/loglog(i) and when q is close to 1/2, 

though the empirical FDR risk can be larger than r/loglog(i), it is rarely 

larger than, say, 1.3 • r/loglog(^). 

Second, we illustrate the behavior of the ideal risk function introduced 
in the second part of Theorem 5.1. Figure 5 illustrates an example of the 
ideal risk decomposition into bias proxy and variance proxy, showing the 
maxima of each and the different ranges over which the two assume their 
large values. 

8.2. Generalizations. The approach described here can be directly ex- 
tended to other settings. Jin has recently derived, by similar methods, asymp- 
totic minimaxity of FDR thresholding for sparse Poisson means obeying 
H > 1, with most [ii = \. This could be useful in situations where we have 
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a collection of 'cells' and expect one event per cell in typical cases, with 
occasional 'hot spots' containing more than one event per cell. 

Preliminary calculations show that a wide range of non-Gaussian additive 
noises can also be handled by these methods. To see why, note that due to 
the use of log(/Xj) in both the loss measure and parameter set, the results 
of this paper can be considered a study of FDR thresholding in a situation 
with additive noise having a standard Gumbel distribution. Thus, defining 
Yj = log(Xj), the model of Section 1 posits effectively that 

Yi = 9i + Zi, i = l,...,n, 

where, for 9{ > 0, 

we measure loss by ~ @i) 2 an d the noise Z{ obeys e Zi ~ Exp(l). 

Although we have focused on the one-sided problem in which #j > for 
all i, we can certainly generalize the study to treat the two-sided problem 
with \ &i\ p ) < ff an d where both 9i> and 9{ < are possible. Other 

additive non-Gaussian noises which have been considered include double- 
exponential. Of course, in considering non-Gaussian distributions, the effec- 
tiveness of thresholding depends on the tails of the noise distribution being 
sufficiently light. Thus, asymptotic minimaxity of thresholding would be 
doubtful for additive Cauchy noise. 

Another generalization concerns dependent settings. In principle, FDR 
thresholding can still be 'estimating' the FDR functional in large samples, 
even without i.i.d. stochastic disturbances. Suppose that the Xi are weakly 
dependent, in such a way that their empirical cdf still converges at a root-n 
rate. Then all of the above analysis can be carried through in detail without 
essential change. 

One frequently raised question whether the study here could easily be 
generalized to other distributional settings such as other exponential fam- 
ilies. Unfortunately, the results in this paper depend on some properties 
of the exponential distribution which other exponential families might not 
have. The most important is the monotone likelihood ratio of the family of 
exponential density functions {/^(a;), < fi < oo : = j^e~ x ^ • l{ x >o}} 

[14]; this seems crucial for our argument [12], but some exponential families 
are not MLR. Jin's study shows that the behavior of the FDR functional in 
the discrete Poisson setting is essentially different from that of a continuous 
setting (Gaussian, exponential, etc.). Another frequently raised issue con- 
cerns the possibility of working on the original scale instead of the log-scale. 
However, this does not give rise to a meaningful problem; if we used £ 2 -loss 
on fj, instead of on log^i, the minimax risk would be infinite. 
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8.3. Relation to other work. There are two points of contact with earlier 
literature. The first, of course, is with the work of Abramovich, Benjamini, 
Donoho and Johnstone [3]. Like the present work, [3] proves an asymptotic 
minimaxity property for the FDR thresholding estimator only for Gaussian 
noise, and for a subtly different notion of sparsity. In [3], the sparsity param- 
eter rj = rj n so that the sparsity is linked to sample size, which makes sense 
in a variety of nonparametric estimation applications such as like wavelet 
denoising [1, 2, 7, 8]. In our work, i] goes to zero only after n — ► oo. This 
simplifies our analysis; the underlying tools in [3] — empirical processes, mod- 
erate deviations — are more delicate to deploy than ours. The advantage of 
our approach seems to lie principally in the ease of generalization to a wider 
range of non-Gaussian and dependent situations. 

The second connection is with the work of Genovese and Wasserman [10]. 
While they do not consider our multiparameter estimation problem, they do 
use a Bayesian viewpoint related to Sections 4-6 of our paper. Our approach 
considers, of course, a different class of Bayesian examples and a different 
notion of estimation risk. Their paper seems focused on developing intuition 
and a broader understanding of the FDR approach, while ours uses FDR to 
attack a specific optimal estimation problem. 
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